I plan to use the record (fertility rate) data for the clustering section, which can be found here. The fertility rate dataset contains 10 columns and 5735 rows, however, not all features will be used for the clustering task, which will be shown in the feature selection part. More deatiled information can be found in the fertility rate dataset introduction part.
The goal of this section is to use various clustering methods such as KMean, DBSCAN, and hierarchical clsutering to group the fertility rate data with no pre-defined labels provided. In this case, if the clustering process results in new labels, then it suggests that there might exist unknown or undiscovered groups based on the features of data provided. Discovering potential unkown clusters can be very helpful for further data exploration and utilization. It is also helpful for identifying underlying patterns and revealing internel structure of the data.
This part will provide a snapshot as well as the basic information of the fertility rate dataset.
# import relevant packages
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
Here is the snapshot of the fertility rate dataset that will be used for clustering analysis:
#import dataset
df = pd.read_csv("../../data/01-modified-data/FertilityRate_Clean.csv")
df.head()
| Country_Name | Country_Code | year | FertilityRate | Region | IncomeGroup | GDPperCapita_USD | Human_Dev_Index | Tertiary_school_Enroll_Pctg | label | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1990 | 7.466 | South Asia | Low income | 287.200843 | 0.426556 | 2.211410 | Above Replacement |
| 1 | Afghanistan | AFG | 1991 | 7.479 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 | Above Replacement |
| 2 | Afghanistan | AFG | 1992 | 7.502 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 | Above Replacement |
| 3 | Afghanistan | AFG | 1993 | 7.535 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 | Above Replacement |
| 4 | Afghanistan | AFG | 1994 | 7.572 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 | Above Replacement |
Below is the basic information of the fertility dataset, it has 5735 rows and 10 features:
print("The shape of this dataframe is:", df.shape, '\n')
df.info()
The shape of this dataframe is: (5735, 10) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5735 entries, 0 to 5734 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country_Name 5735 non-null object 1 Country_Code 5735 non-null object 2 year 5735 non-null int64 3 FertilityRate 5735 non-null float64 4 Region 5735 non-null object 5 IncomeGroup 5735 non-null object 6 GDPperCapita_USD 5735 non-null float64 7 Human_Dev_Index 5735 non-null float64 8 Tertiary_school_Enroll_Pctg 5735 non-null float64 9 label 5735 non-null object dtypes: float64(4), int64(1), object(5) memory usage: 448.2+ KB
As mentioned in the introduction part, not all features will be or should be used for the clustering analysis. For example, columns like Country_Name and Country_Code probably would not add much value to the clustering process because they count as labels. The label column will also be excluded due to the reason that it contains the label generated based on the fertility rate of a country/region. Therefore, the updated dataset with features needed looks like this:
DF = df.drop(['Country_Name', 'Country_Code', 'label'], axis=1)
DF.head()
| year | FertilityRate | Region | IncomeGroup | GDPperCapita_USD | Human_Dev_Index | Tertiary_school_Enroll_Pctg | |
|---|---|---|---|---|---|---|---|
| 0 | 1990 | 7.466 | South Asia | Low income | 287.200843 | 0.426556 | 2.211410 |
| 1 | 1991 | 7.479 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 |
| 2 | 1992 | 7.502 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 |
| 3 | 1993 | 7.535 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 |
| 4 | 1994 | 7.572 | South Asia | Low income | 287.200843 | 0.426556 | 2.780193 |
DF.isna().sum()
year 0 FertilityRate 0 Region 0 IncomeGroup 0 GDPperCapita_USD 0 Human_Dev_Index 0 Tertiary_school_Enroll_Pctg 0 dtype: int64
Based on the results above, we can see there are no missing values in this dataset.
Get various y labels
print('--------------- Unique Region Values ---------------\n', DF['Region'].unique())
print('\n--------------- Unique Income Group Values ---------------\n', DF['IncomeGroup'].unique())
--------------- Unique Region Values --------------- ['South Asia' 'Sub-Saharan Africa' 'Europe & Central Asia' 'Middle East & North Africa' 'Latin America & Caribbean' 'East Asia & Pacific' 'North America'] --------------- Unique Income Group Values --------------- ['Low income' 'Lower middle income' 'Upper middle income' 'High income' 'Unclassified']
# plot number of records by regions
sns.set_theme(style="whitegrid")
sns.countplot(x = df['Region'], alpha = 0.9)
plt.xticks(rotation='vertical')
(array([0, 1, 2, 3, 4, 5, 6]), [Text(0, 0, 'South Asia'), Text(1, 0, 'Sub-Saharan Africa'), Text(2, 0, 'Europe & Central Asia'), Text(3, 0, 'Middle East & North Africa'), Text(4, 0, 'Latin America & Caribbean'), Text(5, 0, 'East Asia & Pacific'), Text(6, 0, 'North America')])
# plot number of records by Income Groups
sns.set_theme(style="whitegrid")
sns.countplot(x = df['IncomeGroup'], alpha = 0.9)
plt.xticks(rotation=45)
(array([0, 1, 2, 3, 4]), [Text(0, 0, 'Low income'), Text(1, 0, 'Lower middle income'), Text(2, 0, 'Upper middle income'), Text(3, 0, 'High income'), Text(4, 0, 'Unclassified')])
sns.scatterplot(data=DF, x="GDPperCapita_USD", y="FertilityRate", hue="Region")
plt.xlabel("GDP per Capita (USD)")
plt.ylabel("Fertility Rate")
plt.title("GDP per Capita (USD) vs. Fertility Rate by Region")
Text(0.5, 1.0, 'GDP per Capita (USD) vs. Fertility Rate by Region')
sns.scatterplot(data=DF, x="Human_Dev_Index", y="FertilityRate", hue="Region")
plt.xlabel("Human Development Index")
plt.ylabel("Fertility Rate")
plt.title("Human Development Index vs. Fertility Rate by Region")
Text(0.5, 1.0, 'Human Development Index vs. Fertility Rate by Region')
sns.scatterplot(data=DF, x="Tertiary_school_Enroll_Pctg", y="FertilityRate", hue="Region")
plt.xlabel("Tertiary School Enrollment Percentage")
plt.ylabel("Fertility Rate")
plt.title("Tertiary School Enrollment Percentage vs. Fertility Rate by Region")
Text(0.5, 1.0, 'Tertiary School Enrollment Percentage vs. Fertility Rate by Region')
sns.scatterplot(data=DF, x="GDPperCapita_USD", y="FertilityRate", hue="IncomeGroup")
plt.xlabel("GDP per Capita (USD)")
plt.ylabel("Fertility Rate")
plt.title("GDP per Capita (USD) vs. Fertility Rate by Income Group")
Text(0.5, 1.0, 'GDP per Capita (USD) vs. Fertility Rate by Income Group')
sns.scatterplot(data=DF, x="Human_Dev_Index", y="GDPperCapita_USD", hue="Region")
plt.xlabel("Human Development Index")
plt.ylabel("Fertility Rate")
plt.title("Human Development Index vs. Fertility Rate by Region")
Text(0.5, 1.0, 'Human Development Index vs. Fertility Rate by Region')
Above are some of the possible clusterings that can be generated by the data offered, we can see that these groups are intertwined and do not have a regular shape or area. Therefore, different kinds of clustering methods should be applied to the data in order to attain more accurate groups.
Split the dataset as X and y, y will not be used for clustering since it is unsupervised learning. Then normalize X: using standard score (x - mu)/sigma on numeric variables, and replace categorical values with category codes by using the cat.codes function.
X = DF
y = df[['label']]
X['Region'] = X['Region'].astype('category').cat.codes
X['IncomeGroup'] = X['IncomeGroup'].astype('category').cat.codes
#X['FertilityRate'] = (X['FertilityRate'] - X['FertilityRate'].mean()) / X['FertilityRate'].std()
#X['GDPperCapita_USD'] = (X['GDPperCapita_USD'] - X['GDPperCapita_USD'].mean()) / X['GDPperCapita_USD'].std()
# X['Human_Dev_Index'] = (X['Human_Dev_Index'] - X['Human_Dev_Index'].mean()) / X['Human_Dev_Index'].std()
#X['Tertiary_school_Enroll_Pctg'] = (X['Tertiary_school_Enroll_Pctg'] - X['Tertiary_school_Enroll_Pctg'].mean()) / X['Tertiary_school_Enroll_Pctg'].std()
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
scalar.fit(X)
X = scalar.transform(X)
K-means clustering is one of the commonly and frequently used clustering methods. By randomly initializing the number of k centroids (k can be a self-defined number, a centroid is the center point of a cluster), it keeps assigning data points to the nearest centroids and re-assigning centroids through numerous iterations until there is nothing can be changed further. The final groups will be the ones after the iterations.
Here is the visualization of how K-means clustering works (Source):
from IPython.display import Image
Image(filename='kmeans.png')
# import relevent libraries for clustering. we will use KMeans, AgglomerativeClustering, MeanShift, Birch, and DBSCAN
from statistics import mode
from sklearn.cluster import KMeans, AgglomerativeClustering, MeanShift, Birch, DBSCAN
from scipy.spatial.distance import cdist
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
For K-means clustering I will use the elbow method to find the optimal number of clusters. The elbow method is a hyper-parameter tuning method that is very easy to understasnd and implement: using the turning point, also known as the 'elbow' of a line chart to help us identify the most optimal K (the number of clusters) for clustering a dataset since the ideal model is the one with low inertia and a low number of K.
I will use the inertia attribute to find the sum of squared distances of samples to their closest cluster center. I will use the range of 1 to 10 clusters, and plot the inertia values for each number of clusters.
distortions = []
inertias = []
k = 11
for k in range(1, k):
kmeanModel = KMeans(n_clusters=k, init='k-means++', random_state=42)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
inertias.append(kmeanModel.inertia_)
evaluation = pd.DataFrame.from_records({'Cluster':np.arange(1, k+1), 'Distortion':distortions, 'Inertia':inertias})
evaluation
| Cluster | Distortion | Inertia | |
|---|---|---|---|
| 0 | 1 | 2.510992 | 40145.000000 |
| 1 | 2 | 2.009034 | 26149.671429 |
| 2 | 3 | 1.659346 | 18351.026057 |
| 3 | 4 | 1.545402 | 15871.472113 |
| 4 | 5 | 1.433925 | 13716.470853 |
| 5 | 6 | 1.340790 | 12219.332245 |
| 6 | 7 | 1.305025 | 11058.091019 |
| 7 | 8 | 1.254608 | 10240.145741 |
| 8 | 9 | 1.204383 | 9515.812120 |
| 9 | 10 | 1.165283 | 8870.208760 |
evaluation.plot.line(x = 'Cluster', subplots=True)
array([<AxesSubplot:xlabel='Cluster'>, <AxesSubplot:xlabel='Cluster'>],
dtype=object)
Based on the plots above, we can see that the 'elbow' of the line charts is at 3. Therefore, it is safe to assume an optimal K (number of clusters) for the fertility rate dataset would be 3.
In this part, we will plot the data points using the original labels defined based on the fertility rate, as well as the data points using the labels from K-means clustering method.
bestK = KMeans(n_clusters=3, init='k-means++', random_state= 42)
km_labels = bestK.fit_predict(X)
df['km_labels'] = km_labels
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(data=df, x="GDPperCapita_USD", y="FertilityRate", hue="label", ax=ax[0])
sns.scatterplot(data=df, x="GDPperCapita_USD", y="FertilityRate", hue="km_labels", ax=ax[1])
<AxesSubplot:xlabel='GDPperCapita_USD', ylabel='FertilityRate'>
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(data=df, x="Human_Dev_Index", y="FertilityRate", hue="label", ax=ax[0])
sns.scatterplot(data=df, x="Human_Dev_Index", y="FertilityRate", hue="km_labels", ax=ax[1])
<AxesSubplot:xlabel='Human_Dev_Index', ylabel='FertilityRate'>
From the plots above, we can see that there might exist different kinds of clusters other than the original two groups made based on the fertility rate (above replacement & below replacement). To confirm this, more clustering methods will be applied to the fertility rate dataset.
DBSCAN, short for density-based spatial clustering of applications with noise, is a clustering method that groups data points based on their density. This method works by seperating the data points into low-density space and high-density space based on the number and distance of their nearby neighbor data points. DBSCAN works well with data that is not linearly seperable or has irregular shapes of clusters.
Here is the visualization of how DBSCAN method works and the difference between DBSCAN and other clustering methods (Source):
Image(filename='km02.png')
Image(filename='hc01.png')
Image(filename='db.png')
In this part, I will apply the DBSCAN method to dataset X and predict the labels for the clusters.
The functions below will iterate over one hyper-parameter (grid search) and return the cluster result that optimizes the Silhouette score, which illustrates the distance between the points in different clusters. The result of this will determine the value for the eps parameter in DBSCAN function:
import sklearn.cluster
# THIS WILL ITERATE OVER ONE HYPER-PARAMETER (GRID SEARCH)
# AND RETURN THE CLUSTER RESULT THAT OPTIMIZES THE SILHOUETTE SCORE
def maximize_silhouette(X,algo="birch",nmax=20,i_plot=False):
# PARAM
i_print=False
#FORCE CONTIGUOUS
X=np.ascontiguousarray(X)
# LOOP OVER HYPER-PARAM
params=[]; sil_scores=[]
sil_max=-10
for param in range(2,nmax+1):
if(algo=="birch"):
model = sklearn.cluster.Birch(n_clusters=param).fit(X)
labels=model.predict(X)
if(algo=="ag"):
model = sklearn.cluster.AgglomerativeClustering(n_clusters=param).fit(X)
labels=model.labels_
if(algo=="dbscan"):
param=0.5*(param-1)
model = sklearn.cluster.DBSCAN(eps=param).fit(X)
labels=model.labels_
if(algo=="kmeans"):
model = sklearn.cluster.KMeans(n_clusters=param).fit(X)
labels=model.predict(X)
try:
sil_scores.append(sklearn.metrics.silhouette_score(X,labels))
params.append(param)
except:
continue
if(i_print): print(param,sil_scores[-1])
if(sil_scores[-1]>sil_max):
opt_param=param
sil_max=sil_scores[-1]
opt_labels=labels
print("OPTIMAL PARAMETER =",opt_param)
if(i_plot):
fig, ax = plt.subplots()
ax.plot(params, sil_scores, "-o")
ax.set(xlabel='Hyper-parameter', ylabel='Silhouette')
plt.show()
return opt_labels
# UTILITY PLOTTING FUNCTION
def plot(X,color_vector):
fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1],c=color_vector, cmap="viridis") #, alpha=0.5) #, c=y
ax.set(xlabel='Feature-1 (x_1)', ylabel='Feature-2 (x_2)',
title='Cluster data')
ax.grid()
# fig.savefig("test.png")
plt.show()
# DBSCAN
opt_labels=maximize_silhouette(X,algo="dbscan",nmax=15, i_plot=True)
plot(X,opt_labels)
OPTIMAL PARAMETER = 1.0
Based on the Silhouette score generate above, we can see that the optimal value for the eps parameter would be 1.
model = DBSCAN(eps=1, min_samples=3).fit(X)
labels_DBSCAN = model.labels_
df['db_labels'] = labels_DBSCAN
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(data=df, x="GDPperCapita_USD", y="FertilityRate", hue="label", ax=ax[0])
sns.scatterplot(data=df, x="GDPperCapita_USD", y="FertilityRate", hue="db_labels", ax=ax[1])
<AxesSubplot:xlabel='GDPperCapita_USD', ylabel='FertilityRate'>
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(data=df, x="Human_Dev_Index", y="FertilityRate", hue="label", ax=ax[0])
sns.scatterplot(data=df, x="Human_Dev_Index", y="FertilityRate", hue="db_labels", ax=ax[1])
<AxesSubplot:xlabel='Human_Dev_Index', ylabel='FertilityRate'>
Based on the plots above, we can see that although DBSCAN also divide the data points into two groups, the size of one group is substanstially larger than the other group. In other words, the clusterings seem to be heavily imbalanced in size, which contributes very limited value for distinguishing any potential underlying patterns in the data. Therefore, DBSCAN might not be the optimal clustering method for the fertility rate dataset.
Unlike partitional clustering such as k means clustering divides data points into seperate and non-repeating groups, clusterings created by hierarchical clustering methods can be nested and overlapped. In other words, there exists hierarchies in the groups. Therefore, there does not exist a fixed number of k that defines the number of clusterings.
There are normally two kinds of hierarchical clustering strategies:
The clusters formed using hierarchical clustering methods are usually displayed by dendrograms. An example of drogram will be shown below (Source):
Image(filename='dendro.png')
For the hierarchical clustering part, I choose to perform agglomerative clustering approach on the fertility rate dataset, which means the clustering process will be implemented in a bottom-up manner.
# Perform Agglomerative Clustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering().fit(X)
labels_agg = model.labels_
# create linkage for agglomerative clustering, and the dendrogram for the linkage.
# Suggest the optimal number of clusters based on the dendrogram.
Z = linkage(X, method='ward')
dend = dendrogram(Z)
plt.axhline(y=40, color='black', linestyle='--', label='21')
<matplotlib.lines.Line2D at 0x7fbb201590a0>
Based on the dendrogram above, it is safe to assume that the optimal number of the clusters could be 2 or 3. From the dendrogram we can see that there exists 3 main sub-clusters, which can be distinguished by the orange, green, and red parts in the plot. However, the data points in the fertility dataset can also be divide into 2 groups based on the blue lines in the dendrogram. This result also aligns with the results from the other two clustering methods performed previously.
Based on the outcomes of all three clustering methods (KMeans, DBSCAN, Hierarchical) used in the previous parts, KMeans Clustering and Agglomerative Clustering work better than DBSCAN. However, all the clustering methods provide insightful results. For example, the result of KMeans clustering suggested it is possible that different kinds of labels exist. In other words, there exists other criteria to group the data points other than the existing one, which is the label (Below Replacement or Above Replacement) created based on the fertility rate of a country/region. In such cases, often times different or more features are being used for the clustering process.